accuracy_score (classification accuracy)#
Accuracy is the simplest classification metric: it’s the fraction of samples you got exactly right.
You will learn:
the math definition (binary, multiclass, multilabel)
a from-scratch NumPy implementation
how decision thresholds change accuracy
how to use accuracy when training a simple logistic regression model
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score as sk_accuracy_score
from sklearn.model_selection import train_test_split
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
Prerequisites#
You know what labels \(y \in \{0,1\}\) (binary) or \(y \in \{1,\dots,K\}\) (multiclass) are.
You’re comfortable with the idea that a classifier may output either:
hard predictions \(\hat{y}\) (a class label), or
scores/probabilities (then you still need a rule to turn them into \(\hat{y}\)).
Notation#
True labels: \(y_1,\dots,y_n\)
Predicted labels: \(\hat{y}_1,\dots,\hat{y}_n\)
Indicator function: \(\mathbf{1}[\text{statement}]\) equals 1 if the statement is true, else 0.
Definition#
Generic (binary or multiclass)#
Accuracy is the average of “correct?” indicators:
With per-sample weights \(w_i \ge 0\):
Binary (via confusion matrix)#
If you define true positives/negatives and false positives/negatives:
Relation to 0–1 loss#
Define the 0–1 loss per sample:
Then:
Multilabel (subset accuracy)#
If each sample has a vector of labels \(\mathbf{y}_i \in \{0,1\}^L\), scikit-learn’s accuracy_score uses subset accuracy:
A sample counts as correct only if all labels match exactly.
Intuition: accuracy is an average of 0/1 per-sample outcomes#
Each sample contributes either:
1 (correct prediction)
0 (incorrect prediction)
Accuracy is just the mean of that vector.
y_true = np.array([0, 1, 1, 0, 1, 0, 0, 1])
y_pred = np.array([0, 1, 0, 0, 1, 1, 0, 1])
correct = (y_true == y_pred).astype(int)
acc = correct.mean()
print('per-sample correct:', correct)
print('accuracy:', acc)
print('sklearn accuracy:', sk_accuracy_score(y_true, y_pred))
fig = go.Figure()
fig.add_trace(
go.Bar(
x=np.arange(len(correct)),
y=correct,
marker_color=["#2ca02c" if c == 1 else "#d62728" for c in correct],
name="correct (1) / wrong (0)",
)
)
fig.add_hline(
y=acc,
line_dash="dash",
line_color="black",
annotation_text=f"accuracy = {acc:.2f}",
annotation_position="top left",
)
fig.update_layout(
title="Accuracy = mean of per-sample correctness",
xaxis_title="sample index",
yaxis_title="correct?",
yaxis=dict(tickmode="array", tickvals=[0, 1]),
)
fig.show()
per-sample correct: [1 1 0 1 1 0 1 1]
accuracy: 0.75
sklearn accuracy: 0.75
From-scratch NumPy implementation#
def accuracy_score_np(y_true, y_pred, *, sample_weight=None, normalize=True):
'''Compute accuracy (and multilabel subset accuracy) using NumPy.
Parameters
----------
y_true, y_pred:
1D arrays (n_samples,) for standard classification, or
2D arrays (n_samples, n_labels) for multilabel subset accuracy.
sample_weight:
Optional array (n_samples,) of non-negative weights.
normalize:
If True, return a fraction in [0, 1]. If False, return the (weighted) count.
'''
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError(f"shape mismatch: y_true {y_true.shape} vs y_pred {y_pred.shape}")
if y_true.ndim == 1:
correct = (y_true == y_pred)
elif y_true.ndim == 2:
# multilabel subset accuracy: all labels must match per sample
correct = np.all(y_true == y_pred, axis=1)
else:
raise ValueError(f"expected 1D or 2D arrays, got ndim={y_true.ndim}")
if sample_weight is None:
if normalize:
return float(np.mean(correct))
return float(np.sum(correct))
w = np.asarray(sample_weight)
if w.ndim != 1 or w.shape[0] != correct.shape[0]:
raise ValueError(f"sample_weight must be shape (n_samples,), got {w.shape}")
correct_f = correct.astype(float)
if normalize:
return float(np.average(correct_f, weights=w))
return float(np.sum(w * correct_f))
def predict_labels_from_proba(p, threshold=0.5):
'''Turn probabilities into hard labels using a threshold.'''
p = np.asarray(p)
return (p >= threshold).astype(int)
# Sanity checks vs scikit-learn
# 1) Binary / multiclass (1D)
y_true = rng.integers(0, 3, size=200)
y_pred = rng.integers(0, 3, size=200)
w = rng.uniform(0.1, 2.0, size=200)
for normalize in [True, False]:
ours = accuracy_score_np(y_true, y_pred, sample_weight=w, normalize=normalize)
theirs = sk_accuracy_score(y_true, y_pred, sample_weight=w, normalize=normalize)
print(normalize, ours, theirs, 'diff', abs(ours - theirs))
# 2) Multilabel subset accuracy (2D)
y_true_ml = rng.integers(0, 2, size=(50, 4))
y_pred_ml = rng.integers(0, 2, size=(50, 4))
print('multilabel:', accuracy_score_np(y_true_ml, y_pred_ml), sk_accuracy_score(y_true_ml, y_pred_ml))
True 0.29867683921252447 0.29867683921252447 diff 0.0
False 62.63396899162787 62.63396899162787 diff 0.0
multilabel: 0.08 0.08
Accuracy depends on the decision threshold#
Many binary classifiers output a probability \(p_i = P(y_i=1\mid x_i)\). To turn that into a predicted label, you choose a threshold \(t\):
So accuracy is really a function of \(t\):
Key property: \(\operatorname{Acc}(t)\) is a step function of \(t\) (it only changes when \(t\) crosses one of the predicted probabilities).
# Synthetic probabilities where threshold choice matters
n = 200
y_true = rng.integers(0, 2, size=n)
# Create probabilities correlated with y_true, but noisy
logit = (y_true * 2 - 1) * 1.2 + rng.normal(0, 1.0, size=n)
p = 1 / (1 + np.exp(-logit))
thresholds = np.linspace(0, 1, 401)
accs = np.array([accuracy_score_np(y_true, predict_labels_from_proba(p, t)) for t in thresholds])
best_idx = int(np.argmax(accs))
best_t = float(thresholds[best_idx])
print('accuracy @ t=0.50:', accuracy_score_np(y_true, predict_labels_from_proba(p, 0.5)))
print('best threshold:', best_t)
print('best accuracy:', float(accs[best_idx]))
fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=accs, mode='lines', name='accuracy(t)'))
fig.add_vline(x=0.5, line_dash='dash', line_color='gray', annotation_text='0.5', annotation_position='top')
fig.add_vline(x=best_t, line_dash='dash', line_color='black', annotation_text=f'best={best_t:.2f}', annotation_position='top')
fig.update_layout(
title='Accuracy as a function of the decision threshold',
xaxis_title='threshold t',
yaxis_title='accuracy',
yaxis=dict(range=[0, 1]),
)
fig.show()
fig = px.histogram(
x=p,
color=y_true.astype(str),
nbins=30,
barmode='overlay',
opacity=0.6,
title='Predicted probability distribution by true class',
labels={'x': 'predicted probability p', 'color': 'true label'},
)
fig.add_vline(x=best_t, line_dash='dash', line_color='black')
fig.add_vline(x=0.5, line_dash='dash', line_color='gray')
fig.show()
accuracy @ t=0.50: 0.91
best threshold: 0.4575
best accuracy: 0.92
A classic pitfall: accuracy on imbalanced data (“accuracy paradox”)#
If one class dominates, a model can achieve high accuracy by always predicting the majority class.
Example: 95% negatives, 5% positives.
Predicting “negative” for everyone gives 95% accuracy.
But it completely fails to detect the positives.
This is why it’s good practice to always look at the confusion matrix (and consider metrics like recall/precision/F1 or balanced accuracy).
# Imbalanced example: 95% of class 0
n = 200
n_pos = int(0.05 * n)
y_true = np.array([1] * n_pos + [0] * (n - n_pos))
rng.shuffle(y_true)
y_pred_all0 = np.zeros_like(y_true)
acc = accuracy_score_np(y_true, y_pred_all0)
print('majority-class baseline accuracy:', acc)
# Confusion matrix counts (binary)
TN = int(np.sum((y_true == 0) & (y_pred_all0 == 0)))
FP = int(np.sum((y_true == 0) & (y_pred_all0 == 1)))
FN = int(np.sum((y_true == 1) & (y_pred_all0 == 0)))
TP = int(np.sum((y_true == 1) & (y_pred_all0 == 1)))
cm = np.array([[TN, FP], [FN, TP]])
fig = go.Figure(
data=go.Heatmap(
z=cm,
x=['pred 0', 'pred 1'],
y=['true 0', 'true 1'],
text=cm,
texttemplate='%{text}',
colorscale='Blues',
showscale=False,
)
)
fig.update_layout(title='Confusion matrix for the majority-class baseline')
fig.show()
fig = px.bar(
x=['class 0', 'class 1'],
y=[int(np.sum(y_true == 0)), int(np.sum(y_true == 1))],
title='Class imbalance in the data',
labels={'x': 'class', 'y': 'count'},
)
fig.show()
majority-class baseline accuracy: 0.95
Using accuracy during optimization: logistic regression (NumPy)#
Why we usually don’t optimize accuracy directly#
Accuracy corresponds to the 0–1 loss, which is non-differentiable with respect to model parameters. Gradient-based methods (like gradient descent) need smooth objectives, so we typically optimize a surrogate such as log loss.
Even if you train with log loss, accuracy is still useful for:
monitoring training (does performance improve?)
comparing models
choosing a decision threshold on a validation set
# Synthetic 2D dataset (two overlapping Gaussians)
n0, n1 = 260, 140
X0 = rng.normal(loc=(-1.0, -1.0), scale=(1.1, 1.1), size=(n0, 2))
X1 = rng.normal(loc=(1.2, 1.0), scale=(1.3, 1.0), size=(n1, 2))
X = np.vstack([X0, X1])
y = np.array([0] * n0 + [1] * n1)
# Shuffle and split
idx = rng.permutation(len(y))
X, y = X[idx], y[idx]
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.35, random_state=42, stratify=y
)
fig = px.scatter(
x=X_train[:, 0],
y=X_train[:, 1],
color=y_train.astype(str),
title='Training data (2D) — overlap makes errors unavoidable',
labels={'x': 'x1', 'y': 'x2', 'color': 'class'},
)
fig.show()
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def add_intercept(X):
return np.c_[np.ones((X.shape[0], 1)), X]
def log_loss_np(y_true, p, eps=1e-15):
p = np.clip(p, eps, 1 - eps)
y_true = y_true.astype(float)
return float(-np.mean(y_true * np.log(p) + (1 - y_true) * np.log(1 - p)))
def fit_logreg_gd(X, y, *, lr=0.2, n_iters=400, l2=0.0, verbose=False):
'''Logistic regression with batch gradient descent (binary).'''
Xb = add_intercept(X)
y = y.astype(float)
w = np.zeros(Xb.shape[1])
history = {
'iter': [],
'train_loss': [],
'train_acc@0.5': [],
}
for it in range(1, n_iters + 1):
z = Xb @ w
p = sigmoid(z)
# Gradient of average log loss + L2 regularization (excluding intercept)
grad = (Xb.T @ (p - y)) / Xb.shape[0]
grad[1:] += l2 * w[1:]
w -= lr * grad
if it % 5 == 0 or it == 1:
y_pred = (p >= 0.5).astype(int)
history['iter'].append(it)
history['train_loss'].append(log_loss_np(y, p))
history['train_acc@0.5'].append(accuracy_score_np(y.astype(int), y_pred))
if verbose:
print(it, history['train_loss'][-1], history['train_acc@0.5'][-1])
return w, history
w, hist = fit_logreg_gd(X_train, y_train, lr=0.15, n_iters=500, l2=0.01)
# Evaluate on validation
p_val = sigmoid(add_intercept(X_val) @ w)
acc_val_05 = accuracy_score_np(y_val, predict_labels_from_proba(p_val, 0.5))
print('validation accuracy @ 0.5:', acc_val_05)
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist['iter'], y=hist['train_loss'], mode='lines', name='train log loss'))
fig.update_layout(title='Training objective (log loss) decreases smoothly', xaxis_title='iteration', yaxis_title='log loss')
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist['iter'], y=hist['train_acc@0.5'], mode='lines', name='train accuracy @ 0.5'))
fig.update_layout(title='Accuracy during training (often changes in jumps)', xaxis_title='iteration', yaxis_title='accuracy', yaxis=dict(range=[0, 1]))
fig.show()
validation accuracy @ 0.5: 0.9142857142857143
# Pick a decision threshold that maximizes validation accuracy
thresholds = np.linspace(0, 1, 401)
accs_val = np.array([accuracy_score_np(y_val, predict_labels_from_proba(p_val, t)) for t in thresholds])
best_idx = int(np.argmax(accs_val))
best_t = float(thresholds[best_idx])
print('best threshold:', best_t)
print('validation accuracy @ best_t:', float(accs_val[best_idx]))
fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=accs_val, mode='lines', name='val accuracy(t)'))
fig.add_vline(x=0.5, line_dash='dash', line_color='gray', annotation_text='0.5', annotation_position='top')
fig.add_vline(x=best_t, line_dash='dash', line_color='black', annotation_text=f'best={best_t:.2f}', annotation_position='top')
fig.update_layout(title='Validation accuracy vs threshold', xaxis_title='threshold', yaxis_title='accuracy', yaxis=dict(range=[0, 1]))
fig.show()
best threshold: 0.5375
validation accuracy @ best_t: 0.9285714285714286
# Decision boundary visualization (threshold = best_t)
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
x1_grid = np.linspace(x1_min, x1_max, 200)
x2_grid = np.linspace(x2_min, x2_max, 200)
xx, yy = np.meshgrid(x1_grid, x2_grid)
grid = np.c_[xx.ravel(), yy.ravel()]
p_grid = sigmoid(add_intercept(grid) @ w).reshape(xx.shape)
fig = go.Figure()
# Background probability field
fig.add_trace(
go.Contour(
x=x1_grid,
y=x2_grid,
z=p_grid,
colorscale='RdBu',
reversescale=True,
opacity=0.6,
contours=dict(showlines=False),
colorbar=dict(title='P(y=1)'),
name='P(y=1)',
)
)
# Decision boundary: p = best_t
fig.add_trace(
go.Contour(
x=x1_grid,
y=x2_grid,
z=p_grid,
showscale=False,
contours=dict(start=best_t, end=best_t, size=1, coloring='lines'),
line=dict(color='black', width=3),
name='boundary',
)
)
fig.add_trace(
go.Scatter(
x=X_val[:, 0],
y=X_val[:, 1],
mode='markers',
marker=dict(size=7, color=y_val, colorscale='Viridis', line=dict(width=0.5, color='white')),
name='validation points',
)
)
fig.update_layout(
title=f'Decision boundary for threshold t={best_t:.2f}',
xaxis_title='x1',
yaxis_title='x2',
)
fig.show()
Practical usage (scikit-learn)#
For most workflows you’ll use scikit-learn:
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
Notes:
For multiclass, pass integer class labels.
For multilabel,
accuracy_scorecomputes subset accuracy (all labels must match).Use
sample_weight=if some samples should count more than others.
# Compare with scikit-learn's LogisticRegression
clf = LogisticRegression(max_iter=2000)
clf.fit(X_train, y_train)
y_pred_val = clf.predict(X_val)
print('sklearn LogisticRegression val accuracy:', sk_accuracy_score(y_val, y_pred_val))
sklearn LogisticRegression val accuracy: 0.9071428571428571
Pros / Cons / When to use#
Pros#
Simple and interpretable: “percent correct”.
Works well when classes are balanced and error costs are similar.
Useful as a quick baseline and sanity check.
Cons#
Misleading on imbalanced datasets (majority-class baseline can look “great”).
Hides which mistakes you make (use a confusion matrix / per-class metrics).
For probabilistic models it is threshold-dependent.
Hard to optimize directly with gradient methods (0–1 loss is non-smooth).
For multilabel, subset accuracy can be too strict.
Good fits#
Balanced multiclass problems (top-1 correctness matters).
Settings where false positives and false negatives have comparable cost.
Consider alternatives when#
Classes are imbalanced:
balanced_accuracy_score, precision/recall/F1, PR-AUC.You care about ranking/threshold tradeoffs: ROC curves, PR curves.
You need probability quality: log loss, Brier score.
Exercises#
Build a classifier that outputs probabilities and show how the best threshold shifts when class imbalance increases.
Construct two models with identical accuracy but very different confusion matrices. Which one would you deploy for a medical screening task?
For multilabel data, compare subset accuracy with per-label accuracy (Hamming accuracy).
References#
scikit-learn
accuracy_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.htmlscikit-learn classification metrics user guide: https://scikit-learn.org/stable/modules/model_evaluation.html